167
protein families and associated proteins, which can be found in ever new combinations of
domains. An overview of protein domain families can be found in the SMART database
(https://smart.embl-heidelberg.de). Alternatively, one uses the “conserved domains”,
which enables independent verification (Lu et al. 2020).
If one wanted to look at the underlying genes, the “clusters of orthologous groups”
(COGs) first gave an overview starting from bacteria (https://www.ncbi.nlm.nih.gov/
COG/). Then eukaryotic gene groups were also considered (COGs; eukaryotic ortholo
gous groups; https://mycocosm.jgi.doe.gov/help/kogbrowser.jsf). In this context, a group
cluster of genes means that the same gene is found in very many organisms and thus the
same protein is always required and encoded in very different organisms: An “ortholog”
because the domain composition is the same. Eventually, these orthologous groups were
systematically extended, called eggNOGs (Huerta-Cepas et al. 2017). Excitingly, it can
also be well shown that the original richness of forms was much smaller, because the
primitive cell underlying all present-day life (the “LUCA”, last universal common
ancestor; Weiss et al. 2018), already using the same genetic code, had only about
1000–1500 proteins, which are still found today as highly conserved protein families in
virtually all organisms (similar, but not completely congruent, with the COGs). The protein
language is universal and only grew from a relatively manageable inventory to its current
richness over billions of years of evolution.
In order for everything to relate correctly to each other at the next higher level, the level
of protein networks, there is considerable biological redundancy and robustness. This is
necessary to ensure that every signal is correctly understood and does not get lost in the
noise (see Chap. 7):
Signals are further amplified in signal cascades. All this can be deciphered by network
analysis. This is a very efficient way of finding central proteins (hubs) that have a large
number of neighbours (e.g. network analysis with Cytoscape). The structure of the net
work also detects interfering signals as well as modifying and reciprocal input (cross-talk).
A fascinating and illustrative example can be found at the KEGG (Kyoto Encyclopedia
of Genes and Genomes) pathway database. These are the “maps of cancer pathways”,
which illustrate important stages of cancer (supporting and inhibiting pathways) for the
user, whereby one can look at the different pathway inventory for different organisms.
Building on these foundations, modeling pathways in cancer development and finding bet
ter drugs against them is certainly a fascinating topic in bioinformatics (see Chap. 13).
Again, the contextuality of all molecules helps to systematically identify the promoting
and inhibiting pathways, for example, by gene expression analyses of healthy cells and
cancer cells (where thus almost all important observed changes in gene expression interact
to further spark cancer).
Redundancy is also reflected by the fact that several synthetic pathways are possible in
metabolic networks for important and many other metabolites. This simultaneously pro
tects against numerous genetic mutations that could otherwise disrupt the network, but
also allows us to cope much better with fluctuations in the metabolites present in the
environment.
12.2 Printing Errors Are Constantly Selected Away in the Cell